Digital radiography image quality evaluation using various phantoms and software

Abstract Purpose To investigate the effect of the exposure parameters on image quality (IQ) metrics of phantom images, obtained automatically using software or from visual evaluation. Methods Three commercial phantoms and a homemade phantom constructed according to the instructions given in the IAEA Human Health Series No. 39 publication were used, along with the respective software that estimate automatically various IQ metrics. Images with various exposure parameters were acquired in a digital radiography (DR) unit. For the commercial phantoms, visual evaluations were also performed. The IQ scores obtained were analyzed to investigate the effects of increasing incident air kerma (IAK), tube potential (kVp), additional filtration, and acquisition protocol on IQ. Results The effects of the exposure parameters on the IQ metrics, determined with the commercial and the IAEA phantoms, were not the same. For example, clear trends of improvement of IQ scores with increased IAK and reduction of most IQ scores with increased kVp were observed mostly with the IAEA phantom, but not with the commercial phantoms (for both automatic and visual scoring methods). For all phantoms, the maximum variations in IQ scores observed for repeated identical exposures were almost always below 10% with automatic evaluation whereas, for visual evaluation, reached 17%. Conclusions Failure to detect some expected trends with the complex commercial phantoms may be attributed to the fact that IQ in DR is more strongly affected by the post‐processing procedures, which may mask the effect of other parameters on IQ, something that was not observed with the simple IAEA phantom.

radiography, the IQ of any acquired image could not be improved, as the optical density, contrast, and latitude of any image could not be modified after film development and fixing. The only possibility to enhance the visibility of various anatomical structures in an X-ray film was to adjust viewing conditions (film viewer and ambient light illumination levels). 1,5 With DR,images can be processed in various ways to adjust and enhance the spatial resolution and contrast, and different postprocessing protocols applied onto the same raw image may result in different IQ levels.
For evaluation of IQ of DR images, physical phantoms like those used in SF imaging are still used. These phantoms contain various structures, and the ability of a human observer to detect these structures is used to quantify IQ. Various phantoms are commercially available, which commonly contain structures to assess spatial resolution (also referred to as high contrast resolution, HCR), low contrast (LC), and image latitude (which in DR is usually referred to as dynamic contrast). IQ metrics, such as signal-to-noise ratio (SNR) and contrast-to-noise ratio, which is also referred to as signal difference-to-noise ratio (SDNR), are also used to quantitatively assess IQ. 6 However, the correlation of the previous metrics with clinical IQ is not straightforward and has been recently disputed. 7 Recent literature suggests that IQ is best evaluated using modulation transfer function (MTF) for the evaluation of spatial resolution and detectability index (d′) for the evaluation of LC resolution (d′ depends on noise power spectrum and task-transfer function). 7 However, performance levels for acceptance or commissioning purposes, as well as for conformance to national or international norms, are still given in terms of lp/mm for HCR and in terms of visual detection of LC structures that have a specific nominal contrast difference with respect to the background. 5,[8][9][10] Therefore, conformance must be confirmed using phantoms that can produce these metrics.
In DR, evaluation of IQ can be done visually as with SF images.Unfortunately,due to intra-and interobserver variability, scoring procedures based on visual observation are often subjective. 7 To overcome this problem, some commercial phantoms are now accompanied with software for an automatic evaluation of IQ, to ensure that a certain DR image will be always scored exactly in the same way, irrespectively of how many times the evaluation is repeated. However, the cost of these phantoms and the accompanying software is considerable. Thus,medical physics departments usually buy only one phantom, and less often the corresponding software, for performing the required IQ tests in all radiography units under their supervision. This means that the possibility of daily or weekly IQ tests, especially for distant facilities, is practically negligible.
To facilitate IQ testing in DR (and allow for an increase in its frequency), IAEA recently developed a remote and automated solution using a simple, inexpensive phantom, and free software. 11 The IAEA publication is accompanied by supplementary material to support the remote/automated QC process, including (1) real size blueprint of the proposed phantom allowing the user to accurately manufacture it, (2) dedicated free software called Automated Tool for Image Analysis (ATIA) to automatically analyze the phantom images and provide advanced and sophisticated metrics of IQ (exported in excel file format), (3) an MS Excel file where the IQ results from each test can easily be fed, for proper documentation of the results and creation of long-time performance charts for each radiographic facility, (4) a user manual of the ATIA software. 12 The IAEA methodology was tested in a pilot study in various clinical scenarios, and initial results showed that the phantom is easily fabricated and enables QC tests to be performed even on a daily or weekly basis, as the software allows for a complete and automated evaluation of the principal performance characteristics of the imaging chain. 13 To test the IAEA methodology in wide clinical scenarios, the IAEA launched in 2021 a Coordinated Research Project (CRP) entitled "Advanced Tools for Quality and Dosimetry of Digital Imaging in Radiology" (E24025). 14 The main objective of this work was to evaluate IQ metrics obtained using commercially available solutions (phantoms and software) and the IAEA radiographic phantom solution. Apart from evaluating similarities and/or differences among phantoms in terms of IQ metrics, the most important question to be answered is whether these phantoms and the related software are sensitive enough to detect subtle changes in IQ. These changes may occur during QC tests performed on different dates for the same radiography unit, due to fluctuations in the exposure factors typically used to acquire the phantom images,as a result of a malfunction or a change in the system's adjustments.

MATERIALS AND METHODS
Four phantoms were used in this study (Table 1 and Figure 1), including three commercial (Leeds TOR CDR, Leeds PIX-13, and IBA Primus A) and one henceforth referred to as the IAEA phantom, locally manufactured (made of PMMA, Cu, and Al) according to IAEA specifications and guidance described in IAEA CRP E24025 project-related documentation. 11,13 The characteristics of the four phantoms and the corresponding IQ metrics calculated automatically from each phantom (using the respective software) are shown in Figures 2-5, for CDR, PIX-13, Primus, and IAEA phantom, respectively, and summarized in Table 1. The X-ray unit used to acquire the phantom images was a Luminos dRF Max (Siemens, Healthineers, Munich, Germany), which has a digital flat panel detector for both fluoroscopy and radiography. This unit has Abbreviations: ATIA, Automated Tool for Image Analysis; MTF, modulation transfer function; SDNR, signal difference-to-noise ratio; SNR, signal-to-noise ratio. a PV: pixel value; PVSD: pixel value standard deviation. As all 10 details were visible at all images, this metric was not considered. b The MTF20% was calculated by interpolation. c The number of visible large low-contrast details in the background, the number of visible step-wedge steps and the number of visible small details within the step wedge cannot be determined by the software. The reported SNR (%) values are equal to the ratios of PVSD and PV of regions of interest (ROIs) positioned in the background area of the phantom multiplied by 100. Therefore, they are noise-to-signal ratio values (NSR) rather than SNR. d There are no circular details in the phantom. The detectability index (d′) is estimated from the noise power spectrum (NPS) and task-transfer function (TTF) for specific object task, which in this case is the detection of circular details of the given diameters. 7 The SNR is calculated as the ratio of the PV and the PVSD of the ROI in the background area of the image (the red ROI of Figure 5) and the SDNR as the difference of the PVs of the ROIs in the aluminum (the green ROI in Figure 5) and the background ROI, divided by the PVSD of the background ROI.
been subjected to an extensive QC procedure prior to the following experiments, and all parameters, including kVp accuracy, and kVp, incident air kerma (IAK), and Automatic Exposure Control (AEC) system repeatability, were well within the adopted performance limits. To investigate the effect of the exposure factors on the IQ scores, the four phantoms were sequentially positioned on the radiographic table, and the following acquisitions were made: 1. Repeated acquisitions using the basic acquisition protocol as follows: abdomen examination protocol, tube potential of 70 kV, AEC adjusted for an IAK of 2.5 µGy on the image receptor, central AEC chamber activated, no additional filtration, and the grid in place. The choice of 70 kV is due to the relatively small thickness of the three commercial phantoms and to the fact that the nominal contrast values for CDR and PIX-13 are reported for 70 kV, and for Primus for 75 kV. For the IAEA phantom, being the most attenuating of all phantoms due to the 2 mm Cu attenuator, the tube potential of the basic protocol was set at 81 kV as suggested by IAEA. 11,13 Any observed variation in IQ scores of the DR images resulting from repeated acquisitions should mainly be attributed to Poisson statistics, and to normal minor variations occurring in exposure factors during repeated exposures under AEC with this radiographic system. 2. Acquisitions with all four phantoms using the basic Abdomen protocol, except that IAK on the image receptor was varied (using the AEC dose level corrections −2,−1,+1,and +2 and direct IAK adjustment for AEC at 1.25 and 5 µGy). The objective of this experiment was to investigate whether IQ scores increase with IAK. 3. Acquisitions using the basic Abdomen protocol, except that kV was varied. In addition to 70 or 81 kV used, respectively, for the basic protocol for the three commercial phantoms and the IAEA phantom, other kVps in the range of 50-125 kV were selected (depending on the phantom thickness), to investigate whether IQ scores decrease with increasing kVp. 4. Acquisitions using the basic Abdomen protocol, except that additional filtrations of 0.1-, 0.2-, and 0.3-mm Cu were used, to investigate whether IQ decreases with the use of harder beam qualities. 5. For CDR and Primus phantoms only, different images were produced from the same acquisition (reprocessing) using all the available selections of Diamond View (a post-processing algorithm used in Siemens DR systems for optimizing IQ of different anatomic regions). For the IAEA phantom, various examination protocols were used to acquire different images as well. The aim was to investigate the effect of different protocols (i.e., post-processing algorithms) on IQ scores.
Regarding visual evaluation, three human observers, with more than 5 years of experience in IQ scoring using QC phantoms, scored independently the radiographic images of the three commercial phantoms. The observers were blinded to the exposure factors and conditions used to acquire the images. The average value of the three readings was taken as the final score.The interobserver agreement was assessed using the Cohen Kappa statistics.

RESULTS
The main results of this study are tabulated in Table 2 (blocks 1-5).   visual evaluation) exhibited moderate-to-strong trends of reduction with kVp increase, whereas for the automatic evaluation of low-contrast sensitivity, the negative trend was weak. Moderate trends of reduction of low-contrast sensitivity with kVp increase were also observed for PIX-13 (automatic and visual evaluation). For Primus, the SNR values exhibited a slight logarithmi-

F I G U R E 7
The effect of the incident air kerma (IAK) increase on automatic and visual scores related to low contrast is shown for the PIX-13 phantom.
cal increase with increasing kVp (the increase was numerically very small), whereas a strong trend of linear increase in the number of discernible dark and light steps with kVp increase was also observed. The latter may be attributed to the reduction of radiation contrast that occurs with the increase of kVp which may increase the range of different thicknesses that can be accommodated with any given post-processing protocol. • For the IAEA phantom, a linear decrease was observed for SDNR and detectability with increasing kVp. In contrast, for SNR, an increase with increasing kVp was observed (as observed with IAK increase). This is indeed notable, as SNR seems to follow a different trend than SDNR and detectability. 4. Regarding the effect of increasing additional filtration on the IQ metrics, the following observations were made: • The increase of additional filtration produced small differences, but the trends in most cases ranged from very weak to moderate. Exception was the SNR for Primus, where a clear trend of logarithmical increase with increasing kVp was observed, but the increase was numerically very small (from 3.14 to 3.35). • For IAEA, a clear negative trend with increased filtration was observed only for SDNR. 5. Regarding the effect of post-processing algorithm, as numerical values of trend cannot be defined, results are based on variations of IQ metrics on the graphs shown in Figure 10a-c for CDR, Figure 11a-c for Primus, and Figure 12a-c for the IAEA phantom. Thus, in Table 2, the "+" symbol denotes that a variation larger than 10% was observed, whereas the "ø" symbol denotes no variation (<10%). As can be F I G U R E 8 The effect of the incident air kerma (IAK) increase on automatic (a) and visual scores (b) related to low contrast is shown for the Primus phantom.

F I G U R E 9
The effect of the incident air kerma (IAK) increase on signal-to-noise ratio (SNR) (a) and signal difference-to-noise ratio (SDNR) (b) is shown for the IAEA phantom. seen in Figure 10a-c, some Diamond View settings are repeated once or twice. In the repetitions, a different post-processing algorithm (usually attached to a respective examination protocol) setting was also used. Table 3 summarizes the effect of changing the post-processing algorithm setting, and the symbols "+" and "ø" are used to denote variation (>10%) or not, in any of the IQ scores.
As seen in Figure 10a, for CDR phantom, the automatically calculated MTF20% values presented intense variations with varying Diamond View settings. Differences were also observed even between images with the same Diamond View setting and different examination protocol setting (see also Table 3). On the contrary, variations in visual evaluation of spatial resolution were relatively small and did not follow the variations of MTF20%. As seen in Figure 10b, for low-contrast sensitivity, variations between Diamond View settings were indeed observed in both automatic and visual evalu-ations and did not seem to follow a common pattern. However, variations in the automatic calculated values did not appear when changing the examination protocol setting (except for the "18 Abdomen"protocol). Finally, as seen in Figure 10c, for the detection of small sized/high contrast details, a value of 14 was reported from the software for all images, except for three protocols where 17 details were detected. It is notable, however, that in two repetitions with 04 Thorax LAT Diamond View setting protocol, the numbers of automatically detected low-contrast details were 14 (Abdomen examination protocol) and 17 (Chest PA examination protocol). The respective visual evaluations presented small variations with different post-processing algorithm settings.
For Primus, a small number of Diamond View settings were tested. As seen in Figure 11a,b, with different postprotocols automatically calculated IQ metrics presented more intense variations, than those visually determined. However, in Figure 11c, it can be seen that the number of dark and light steps which are visually discerned also presented intense variations with post-processing protocol changes. This verified that the dynamic range is affected by the selection of the post-processing algorithm, something that was rather expected.
For the IAEA phantom, where a few images were acquired using different examination protocols instead of the reprocessing of the same image, as seen in Figure 12a-c, the different protocols affected all automatically calculated IQ metrics. The horizontal and vertical MTF20%, though different, followed similar variation pattern. On the contrary, variations of SNR and SDNR did not always follow the same pattern. The larger SNR value was observed for the Knee AP protocol, for which the smallest value of SDNR was observed; something is difficult to explain, but it is in-line with the opposite trends noted before for SNR and SDNR with changing IAK and kVp. The detectability values estimated for small and large diameter details seemed to follow similar pattern (except for cervical spine 6-12 years protocol), though the small detail detectability was more intensely affected by protocol changes.
Finally, regarding the interobserver variability, the results of the Cohen Kappa statistics are presented in Table 4. It can be seen that except from one case, where the agreement was very good (PIX-13: HCR, observers B and C), and two cases, where the agreement was good (CDR: HCR observers B and C; HC-0.5 mm, observers A and C), in all other cases, the agreement between observers ranged from poor to moderate.
The "+" denotes variation, whereas "ø" denotes no variation. Abbreviations: HCR, high contrast resolution; LC, low contrast; MTF, modulation transfer function. a Two values were observed (14 and 17), which cover the whole range of values observed with all the different post-processing protocols tested.

TA B L E 4
Results of Cohen's Kappa a statistics, between the image quality (IQ) scores of the three observers (A, B, and C), which performed the visual grading of the digital images

DISCUSSION
The results of the Cohen Kappa statistics variability, shown in Table 4, are indicative of the basic problems inherent in the visual evaluation of IQ. Intra-but also interobserver variabilities are very well-known problems regarding visual evaluations. The same observer may score differently the same image if he/she repeats the evaluation, another day or many months later, depending on whether he/she is tired or not, focused or in a hurry, and so on. In some cases, different observers may score differently the same image because they may have different visual acuities and/or follow different rules to define when a detail is considered visible or not. For example, regarding the last detail that may be considered half -visible for two observers, one may decide to count it and the other not, and therefore scores will differ by 1. If the scoring of half -points is allowed, it may reduce the difference in scores to half -point or zero. Though score differences may be reduced using strict rules,they cannot be eliminated,and therefore,intra-and interobserver variabilities are always expected. On the other hand, automatic objective evaluation of IQ is free of subjectivity, and results are expected to be always reproducible. However, IQ scores obtained using software may not be the same with the scores obtained from visual observation. As in the case of the CDMAM phantom, which is used for many years for the evaluation of IQ in mammography, conversion factors may be required to convert software scores to visual observer scores. 15,16,17 Due to this reason, a direct comparison between software IQ scores and visual IQ scores was avoided. Furthermore, it should be also noted that in order to calculate an average score with the CDMAM phantom, it is required to acquire 8-16 images with the phantom slightly moved between acquisitions, because the relative position of the phantom's details (gold disks) and the dexels may affect the results. In the present study, to eliminate any variations that could be introduced by changes in position, each phantom was kept at the same position until all the images with different exposure conditions were acquired.
Currently, for the purposes of acceptance/commissioning and routing QC tests, the image is viewed usually on the monitor of the acquisition workstation, and the number of visible details is counted. The bigger the number of visible details is, the bigger the score and the better the IQ. The number of visible details is compared with relevant limits where applicable or with reference values established during commissioning. As image formation is affected by Poisson statistics, when IQ value is right on the limit, the Pass or Fail decision may change if the image acquisition is repeated. In this study, it was verified with all four phantoms that IQ may slightly change between subsequent acquisitions because of statistics.
One of the most important findings of this study was that for the three commercial phantoms, no well-defined pattern of change in both automatic and visual IQ scores was observed with increasing IAK. This was rather unexpected, based on the general principle, that increase of IAK is expected to reduce noise and therefore increase IQ score. Although spatial resolution is mostly affected by dexel size and processing protocol and not by the IAK, for any given protocol, the detectability of details with various sizes and contrasts should be improved with the increase of IAK. In contrast to the commercial phantoms, for the IAEA phantom, the increase of IAK improved the SDNR and detectability values but had a rather unexpected negative effect on SNR.
Similarly,applying higher kVp,which is preferable from patient dose perspective, is expected to decrease radiation contrast and therefore the detectability. Such a clear trend was not observed with any of the commercial phantoms, regarding the IQ metrics automatically obtained, though in CDR a strong negative trend was observed in all visual scores. On the contrary, for Primus, a moderate positive trend in SNR with kVp increase was observed, which agreed with the stronger positive trend observed in the SNR values for the IAEA phantom with kVp increase. On the contrary, strong negative trends with kVp increase were indeed observed with the IAEA phantom for the SDNR and the detectability. Furthermore, the increase in the number of steps visually resolved with the increase of kVp using the Primus phantom was also noteworthy, though this could not be automatically detected by the respective software. It must be noted that the inability to determine automatically the number of discernible steps, the number of discernible small details within the step wedge, and the number of discernible large low-contrast details in the background should be considered to be a major disadvantage of the software accompanying the Primus phantom.
The results of this study exhibited that the increase of additional filtration, which is also preferable from patient dose perspective, did not produce any strong negative trend in IQ in all four phantoms. This finding indicates that the use of additional filtration should be encouraged, to reduce the dose to the patient. The use of 0.1and 0.2-mm Cu reduces IAK by approximately 30% and 50%, respectively. However, this is something that needs further investigation and validation using clinical images.
Regarding the effect of the post-processing algorithms incorporated in every acquisition protocol, it must be noted that these are optimized for clinical and not for phantom images and therefore for scoring IQ using phantoms, especially with the IAEA phantom, it may be more suitable if raw (for-processing) images are used. 11,13 However, obtaining raw images is not always easy (depending on the X-ray unit manufacturer, it may require access to the service mode); therefore, the second-best option is to use the same examination protocol for all subsequent QC tests, to be comparable with previous results. In this study, a raw image of the IAEA phantom was obtained after removing all post-processing from an image originally acquired with the Abdomen protocol. When this image was scored using ATIA, MTF was approximately halved, SDNR and detectability values did not significantly change,but SNR increased from 40 to about 1000. Though it would be interesting to investigate the effect of the exposure parameters on the raw images as well, this study was limited on the IQ of the images acquired using the clinical mode.
This study has some additional limitations as only four phantoms and one X-ray unit were used. The results of this study cannot be generalized to other phantoms and their software, nor to other digital X-ray units from other manufacturers, which may use image receptors of different technologies and different pre-and postprocessing algorithms. Another problem that was not addressed in this study and applies to visual evaluation only is the possible effect of the monitor screen on the scores (conventional computer screen vs. acquisition workstation monitor or vs. PACS workstation monitor). Finally, this study did not address the relevance of IQ scores obtained with phantom with the IQ of clinical images.

CONCLUSION
The study showed that the three commercial phantoms and their related software are not sensitive enough to detect subtle changes in IQ with automatic scoring methods, as they failed to exhibit a clear trend of IQ improvement with increased dose and IQ deterioration with increased tube potential. On the contrary, using the IAEA phantom and methodology, the scores regarding SDNR and detectability followed the trends, which were theoretically expected. However, regarding the opposite trends observed for SDNR and SNR with IAK and kVp, increase cannot be explained with certainty. It seems that the increase of transmission with kVp increases the signal and reduces the noise (i.e., the SNR), but the reduction of radiation contrast entails that the signal differences are getting smaller, and this is why SDNR and detectability reduce with kVp increase. As detectability is considered to be a more suitable IQ metric than SNR, the detectability and therefore the SDNR trend should be the correct one. More work is needed to evaluate in more detail whether conventional QC phantoms are appropriate to be used with clinical examination protocols, and to what extent the results of evaluations with such phantoms (like the observed minimal effect of additional filtration on IQ) are applicable in clinical images. As aforementioned, these protocols are designed to accentuate minor anatomical differences, and the effect of dose and beam quality on IQ may be concealed by the superior effect of the processing protocol on IQ. What was made evident in this study is that the diversity of the post-processing protocols and relevant adjustments is too large, and therefore, it is practically impossible to evaluate all of them, not even during the commissioning phase. Therefore, for long-term monitoring of IQ, a single examination protocol used for a frequent clinical examination, like Abdomen AP or Chest PA, should be used.

AC K N OW L E D G M E N T S
The construction of the IAEA phantom was funded by the International Atomic Energy Agency (IAEA) in the context of the Coordinated Research project E24025 entitled "Advanced Tools for Quality and Dosimetry of Digital Imaging in Radiology"; https://www.iaea.org/ projects/crp/e24025. Open Access funding provided by the Qatar National Library.

C O N F L I C T O F I N T E R E S T
The authors have no conflict of interest to state.

AU T H O R C O N T R I B U T I O N S
All authors substantially contributed to the conception or design of the research and analyzed/interpreted results. Moreover, all authors contributed to drafting the manuscript or revising it critically for important intellectual content.